Skip to content

feat: add Amazon Textract integration (#2391)#3148

Open
zafatar wants to merge 7 commits intodeepset-ai:mainfrom
zafatar:main
Open

feat: add Amazon Textract integration (#2391)#3148
zafatar wants to merge 7 commits intodeepset-ai:mainfrom
zafatar:main

Conversation

@zafatar
Copy link
Copy Markdown

@zafatar zafatar commented Apr 13, 2026

Add AmazonTextractConverter component that extracts text from images and single-page PDFs using the AWS Textract synchronous API. Supports both DetectDocumentText (plain OCR) and AnalyzeDocument (tables, forms, signatures, layout) as well as natural-language queries. Includes CI workflow, unit/integration tests, pydoc config, and repo-level wiring (labeler, coverage comment, README).

Related Issues

Proposed Changes:

Similar to the other converter tools such as Azure Document Intelligence or other Amazon resources such as Amazon Bedrock, it covers the access to the Amazon Textract by using boto3 and AWS credentials from the environment variables.

How did you test it?

The tests are run as two separate groups:

cd ./integrations/amazon_textract
hatch run test:unit
hatch run test:integration

Notes for the reviewer

Checklist

Add AmazonTextractConverter component that extracts text from images and single-page PDFs using the AWS Textract synchronous API. Supports both DetectDocumentText (plain OCR) and AnalyzeDocument (tables, forms, signatures, layout) as well as natural-language queries. Includes CI workflow, unit/integration tests, pydoc config, and repo-level wiring (labeler, coverage comment, README).
@zafatar zafatar requested a review from a team as a code owner April 13, 2026 12:37
@zafatar zafatar requested review from bogdankostic and removed request for a team April 13, 2026 12:37
@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 13, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @zafatar! It looks already good to me, I just left a few minor comments on how it can be further improved.

When provided, the Textract ``QUERIES`` feature type is enabled
automatically and each question is sent as a query. Answers are
included in the raw Textract response. Example:
``["What is the patient name?", "What is the total due?"]``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use single backticks here.

Suggested change
``["What is the patient name?", "What is the total due?"]``
`["What is the patient name?", "What is the total due?"]`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will be generated automatically once we do the release, so we can remove it here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some checks that warning messages are raised for cases that go wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:amazon-textract topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants